Advanced Statistical Inference I

Chelsea Parlett-Pelleriti

Inference Review

Bayes vs. Freq

  • Bayesians think parameters (\(\theta\)) are random (think multiverse) and data are fixed
    • Bayesians make probability statements about parameters
    • Bayesians imagine what values parameters might take on in different universes
    • Bayesians use probability distributions (priors and posteriors) to describe the uncertainty they have about what a parameter value might be

Bayes vs. Freq

  • Frequentists think data are random (think repeated sampling) and parameters (\(\theta\)) are fixed
    • Frequentists make probability statements about sample statistics (mean, CI)
    • Frequntists imagine what our estimates might look like under repeated samples from the same population
    • Frequentists care about long run error rates (e.g. \(\alpha\) which is our Type I error rate in a NHST test)

Parameter Estimation vs. Hypothesis Testing

  • Parameter Estimation asks “What is the value of this parameter?”
    • Honing in on a value
  • Hypothesis Testing asks “Which hypothesis is supported?”
    • making a discrete decision (e.g. reject, fail to reject)

Frequentist PE and HT

  • Frequentist PE: Confidence Intervals, Standard Errors, and Test Statistics
  • Frequentist HT (NHST): p-values, \(\alpha\), statistical power

Bayesian PE and HT

  • Bayesian PE: Posterior Means/Medians, Credible Intervals (HDI, ETI), Posterior Summaries (e.g. \(p(x > 0)\))
  • Bayesian HT: Posterior Odds, Bayes Factors, Credible Interval Overlap

Bayesian Regression Models

Regression (GLM) Model Review

\[ \mathbb{E}\left (y | \mathbf{X}\right ) = \mu = g^{-1}\left (\mathbf{X}\beta \right ) \\ p\left (y|x\right ) \sim \pi\left (\mu,...\right ) \]

  • link function: \(g()\)

  • likelihood function: \(\pi()\)

Linear Regression Review

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2) \\ \mu_i = \mathbf{X_i}\beta = \beta_0 + \beta_1 \cdot x_{i1} + ... + \beta_n \cdot x_{in} \]

Bayesian Linear Regression

\[ y_i \sim \mathcal{N}(\mu_i, \color{red}\sigma^2) \\ \mu_i = \mathbf{X_i}\color{red}\beta = \color{red}\beta_0 + \color{red}\beta_1 \cdot x_{i1} + ... + \color{red}\beta_n \cdot x_{in} \]

Setting Priors

Choosing BLR Priors

Let’s look at a regression model that uses age to predict income (in thousands).

\[ \underbrace{\mu_i}_{\mathbb{E}(\text{income})} = \color{red}{\beta_0}+ \color{red}{\beta_1} \cdot age_i \\ \text{income}_i \sim \mathcal{N}(\mu_i, \color{red}{\sigma^2}) \]

\[ \beta_0 \sim \mathcal{N}(m_0, s_0^2); \beta_1 \sim \mathcal{N}(m_1, s_1^2);\sigma \sim \Gamma(0.1,10) \]

❓ What other distributions might be reasonable for our two \(\beta\)s?

Choosing BLR Priors

\(\beta_0 \sim \mathcal{N}(0, 1); \beta_1 \sim \mathcal{N}(2, 4);\sigma \sim \Gamma(0.1,10)\)

We put priors on (at least) all parameters in our model.

Uniform Priors

If we do not set priors, most software will assume default priors. In some cases, this means Uninformative Uniform Priors.

Uniform Priors

\[ \text{Posterior} \propto \text{Likelihood} \times \text{Prior} \]

If your prior is uniform, it applies the same multiplier to each possible value of \(\theta\) (your given parameter). So…

\[ \text{Posterior} \propto \text{Likelihood} \]

Uniform Priors

Pros:

  • “Objective” 🙄

  • let’s the “data speak for themselves”

  • easy to explain

  • sometimes gives results roughly equivalent to Frequentist estimates

Cons:

  • you often know something (non-regularizing)

  • non-invariance under re-parameterization

  • improper over \(\mathbb{R_n}\)

Uniform Priors (Freq)

library(brms)
# freq
lm(y ~ x, data = df) |> summary()

Call:
lm(formula = y ~ x, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.11461 -0.74886  0.00784  0.63176  2.12521 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.11802    0.09295   1.270    0.207    
x           -0.47841    0.08475  -5.645 1.61e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9294 on 98 degrees of freedom
Multiple R-squared:  0.2454,    Adjusted R-squared:  0.2377 
F-statistic: 31.86 on 1 and 98 DF,  p-value: 1.61e-07

Uniform Priors (Bayes)

# bayes
priors <- c(
  prior("", class = "b"),
  prior("", class = "Intercept"),
  prior("uniform(0, 1e6)", class = "sigma")
)
brm(y ~ x, data = df, prior = priors, verbose = 2) |> summary()
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ x 
   Data: df (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.12      0.09    -0.07     0.31 1.00     3583     2754
x            -0.48      0.09    -0.64    -0.31 1.00     3460     2741

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.94      0.07     0.82     1.09 1.00     4224     3164

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Uniform Priors (Non-Invariance)

Consider a logistic regression:

\[ \underbrace{\mathbb{E}\left( y | \mathbf{X} \right)}_\text{pred prob} = \frac{1}{1 + e^{-\mathbf{X}\beta}} \]

  • Parameterization 1: estimate log odds coefficients \(\beta\)s

  • Parameterization 2: estimate odds coefficients \(e^{\beta}\)

Uniform Priors (Non-Invariance)

Parameterization 1

Let’s feign ignorance…we know nothing about what \(\beta\)s should be, so we place a uniform prior on the coefficients.

Uniform Priors (Non-Invariance)

Parameterization 2

By the same logic…we know nothing about what \(e^{\beta}\)s should be, so we place a uniform prior on the coefficients.

Uniform Priors (Non-Invariance)

First, let’s look at what Parameterization 1 (uniform prior on \(\beta\)) implies about \(e^\beta\):

Uniform Priors (Non-Invariance)

First, let’s look at what Parameterization 1 (uniform prior on \(\beta\)) implies about \(e^\beta\):

\[ p_{\beta}(\beta) = \frac{1}{upper - lower} = \frac{1}{b-a} \]

let \(\theta = e^\beta\) so \(\beta = ln(\theta)\):

\[ \underbrace{p_\theta(\theta) = p_\beta(\beta) \cdot \lvert\frac{d \beta}{d \theta}\rvert}_\text{change of variable} \\ p_\theta(\theta) = \frac{1}{b-a} \cdot \lvert\frac{1}{\theta}\rvert = \frac{1}{(b-a)\theta} \]

❗ this does not give all \(\theta\)s the same likelihood!!!

Uniform Priors (Non-Invariance)

Now, let’s look at what Parameterization 2 (uniform prior on \(e^\beta\)) implies about \(\beta\):

\[ p_{\theta}(\theta) = \frac{1}{upper - lower} = \frac{1}{b-a} \]

\(\beta = ln(\theta)\) so \(\theta = e^\beta\):

\[ \underbrace{p_\beta(\beta) = p_\theta(\theta) \cdot \lvert\frac{d \theta}{d \beta}\rvert}_\text{change of variable} \\ p_\beta(\beta) = \frac{1}{b-a} \cdot \lvert e^\beta\rvert = \frac{e^\beta}{(b-a)} \]

Uniform Priors (Non-Invariance)

Now, let’s look at what Parameterization 2 (uniform prior on \(e^\beta\)) implies about \(\beta\):

Jeffrey’s Prior

\[ p(\theta) \propto \sqrt{I(\theta)} \]

  • Jeffrey’s Prior: method of constructing non-informative priors that are invariant under re-parameterization

(see Bayesian Data Analysis 2.8 for more math)

Fisher Information

Slightly Mathy Idea: Fisher Information measures how sensitive the log-likelihoodfunction \(\ell(\theta | X)\) is to changes in \(\theta\) (more sensitive \(\to\) more information)

Fisher Information

Math Idea:

\[ I_X(\theta) = -\mathbb{E}\left[\frac{\partial^2 \ell(\theta | X)}{\partial\theta^2} \right] \]

where \(\ell(\theta | X)\) is the log-likelihood of \(\theta\) given \(X\). If \(\ell\) is sensitive to changes in \(\theta\), the second derivative should be large and we expect to see high information

Note: usually \(\ell\) is concave down around maximum likelihood estimate, meaning that the second-derivative will be negative, hence the negative sign in front of the expectation

Uniform Priors (Regularization)

Regularization (in general): discourages overfitting, makes models simpler

Regularizing Priors: keeps parameter values in a “reasonable range”

Uniform Priors (Regularization)

Ridge = Normal Prior, Lasso = Laplacian Prior.

Other Types of Priors

  • Uniformative: all possible values for \(\theta\) have the same relative prior likelihood

  • Weakly Informative: regularize our estimates, extreme values are less likely, but not much other info

  • Strongly Informative: prior is narrow around known likely values

Other Types of Priors

  • Uniformative: all possible values for \(\theta\) have the same relative prior likelihood

  • Weakly Informative: regularize our estimates, extreme values are less likely, but not much other info

  • Strongly Informative: prior is narrow around known likely values

When would you use each type of prior?

Choosing Prior Distributions

  • think about the range of possible values of \(\theta\)

  • think about amount of prior uncertainty

  • think about regularization

In reality, we often report 2-3 different versions of the models with various priors to see how these priors change the inferences we make.

Setting Priors in brms

set.seed(540)
n <- 100
x <- rnorm(n)
y <- 0.25 - x*0.5 + rnorm(n)
df <- data.frame(x,y)

# bayes
priors <- c(
  prior("normal(0,5)", class = "b"),
  prior("normal(0,4)", class = "Intercept"),
  prior("gamma(0.1, 10)", class = "sigma")
)
brm(y ~ x, data = df, prior = priors, verbose = 2) |> summary()

CHECKING DATA AND PREPROCESSING FOR MODEL 'anon_model' NOW.

COMPILING MODEL 'anon_model' NOW.

STARTING SAMPLER FOR MODEL 'anon_model' NOW.

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.008 seconds (Warm-up)
Chain 1:                0.007 seconds (Sampling)
Chain 1:                0.015 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.008 seconds (Warm-up)
Chain 2:                0.007 seconds (Sampling)
Chain 2:                0.015 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.02 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.007 seconds (Warm-up)
Chain 3:                0.007 seconds (Sampling)
Chain 3:                0.014 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.007 seconds (Warm-up)
Chain 4:                0.006 seconds (Sampling)
Chain 4:                0.013 seconds (Total)
Chain 4: 
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ x 
   Data: df (Number of observations: 100) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.25      0.09     0.07     0.44 1.00     3900     3130
x            -0.49      0.10    -0.70    -0.29 1.00     3793     2966

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.98      0.07     0.86     1.12 1.00     3827     2876

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).